class: center, middle, inverse, title-slide .title[ # Data Science Practice ] .subtitle[ ## STATS 369 Coursebook: Week 2 ] --- exclude: true --- class: centre, middle # Lecture 4 --- exclude: true --- ## Plan for this week * **<mark>[L04] Data Visualisation with `{ggplot2}`</mark>** + Motivation + `{ggplot2}` package + Aesthetic attributes + Geometric objects + Facets * **[L05] Examples** * **[L06] General Comments** + Which (common) plot to use? + What to pay attention to? + Further comments --- class: inverse, center, middle # Motivation --- ## Why graphs are important? .pull-left[ * *First impression* matters -- it is visually stimulating. * *Efficiency in exploring the data* -- Always visualise your data sets before creating any models! * *Effective communication* -- 'A picture is worth a thousand words'. * Sometimes, summary statistics are just not enough -- see examples [here](https://www.autodeskresearch.com/publications/samestats). ] .pull-right[ <img src="data:image/png;base64,#img/W02L01-1-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Why graphs are important? .pull-left[ * *First impression* matters -- it is visually stimulating. * *Efficiency in exploring the data* -- Always visualise your data sets before creating any models! * *Effective communication* -- 'A picture is worth a thousand words'. * Sometimes, summary statistics are just not enough -- see examples [here](https://www.autodeskresearch.com/publications/samestats). ] .pull-right[ <img src="data:image/png;base64,#img/W02L01-1A-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## ggplot The name `ggplot` comes from the book *The Grammar of Graphics* by Leland Wilkinson (2005) (ref: ISBN 978-0-387-98774-3). A grammar of graphics is a <u>framework</u> that allows a <u>structured</u> and <u>layered</u> approach to construct graphics. #### Components of a graph <img src="data:image/png;base64,#./img/components_of_ggplot.jpg" width="65%" style="display: block; margin: auto;" /> [Image source](https://towardsdatascience.com/a-comprehensive-guide-to-the-grammar-of-graphics-for-effective-visualization-of-multi-dimensional-1f92b4ed4149) --- class: inverse, middle, center # Data Visualisation with `{ggplot2}` --- ## The `{ggplot2}` package An R visualisation package developed by Hadley Wickham (2007) that adapts and implements the concept of *ggplot*. > 'ggplot2 (Wickham 2009) builds on Wilkinson’s grammar by focussing on the primacy of layers and adapting it for use in R. In brief, the grammar tells us that a graphic maps the data to the **aesthetic attributes** (colour, shape, size) of **geometric objects** (points, lines, bars). The plot may also include **statistical transformations** of the data and information about the plot’s **coordinate system**. **Facetting** can be used to plot for different subsets of the data. The combination of these independent components are what make up a graphic.' -- Hadley Wickham, [*ggplot2*](https://ggplot2-book.org/) <br><br> Check out the 'R Graph Gallery' [here](https://www.r-graph-gallery.com/). --- ## Aesthetic attributes What attributes will be mapped onto the x-axis and y-axis? .pull-left[ ] .pull-right[ <img src="data:image/png;base64,#img/W02L01-p0-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Scatter plot Now we can actually add some points .pull-left[ ```r airquality %>% ggplot(aes(x = Solar.R, y = Ozone)) + * geom_point() ``` ] .pull-right[ <img src="data:image/png;base64,#img/W02L01-p1-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Colours How about adding the colour to points based on another (factor) variable. .pull-left[ ```r airquality %>% ggplot(aes(x = Solar.R, y = Ozone, * color = factor(Month))) + geom_point() ``` ] .pull-right[ <img src="data:image/png;base64,#img/W02L01-p3-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Geometric objects We have seen points, what else? .pull-left[ ```r airquality %>% ggplot(aes(x = Solar.R, y = Ozone)) + geom_point() + * geom_smooth() ``` ] .pull-right[ <img src="data:image/png;base64,#img/W02L01-p4-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Geometric objects .pull-left[ ```r airquality %>% ggplot(aes(x = Solar.R, y = Ozone)) + * geom_hex() ``` ] .pull-right[ <img src="data:image/png;base64,#img/W02L01-p5-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Geometric objects .pull-left[ ```r airquality %>% ggplot(aes(x = factor(Month), y = Ozone)) + * geom_boxplot() ``` ] .pull-right[ <img src="data:image/png;base64,#img/W02L01-p6-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Facets Dividing data sets into sub groups and plot separately for each group. It is useful when when the relationship is beyond 2D -- you can explore relationship between two variables conditioned on other variable(s). There are two common types of faceting in R, `facet_grid` and `facet_wrap`. .pull-left[ ```r airquality %>% ggplot(aes(x = Solar.R, y = Ozone)) + geom_point() + * facet_wrap(~Month, nrow = 2) ``` ] .pull-right[ <img src="data:image/png;base64,#img/W02L01-p7-1.png" width="75%" style="display: block; margin: auto;" /> ] --- ## Incorporating data processing .pull-left[ ```r airquality %>% na.omit() %>% mutate(TempGp = cut(Temp, breaks = quantile(Temp, (0:4)/4), * nc = TRUE)) %>% # use %>% mutate(WindGp = cut(Wind, breaks = quantile(Wind, (0:4)/4), inc = TRUE)) %>% ggplot(aes(x = Solar.R, y = Ozone)) + * geom_point() + # use + facet_grid(WindGp ~ TempGp) ``` ] .pull-right[ <img src="data:image/png;base64,#img/W02L01-p8-1.png" width="100%" style="display: block; margin: auto;" /> ] <!-- ### EOF ### --> --- exclude: true --- class: inverse, center, middle # The NZ Vehicle Registration Data --- ## Car registration open data [NZ vehicle registration open data](https://nzta.govt.nz/resources/new-zealand-motor-vehicle-register-statistics/new-zealand-vehicle-fleet-open-data-sets/) provides a snapshot of the currently registered fleets in NZ. The 2019 dataset has 206,099 rows with 34 columns. ```r cars.df <- read_csv("datasets/VehicleYear-2019.csv") # glimpse(cars.df) dim(cars.df) # dimension of the data frame head(names(cars.df), 20) # some column names of the cars.df ``` ``` ## [1] 206099 34 ``` ``` ## [1] "ALTERNATIVE_MOTIVE_POWER" "BASIC_COLOUR" ## [3] "BODY_TYPE" "CC_RATING" ## [5] "CHASSIS7" "CLASS" ## [7] "ENGINE_NUMBER" "FIRST_NZ_REGISTRATION_YEAR" ## [9] "FIRST_NZ_REGISTRATION_MONTH" "GROSS_VEHICLE_MASS" ## [11] "HEIGHT" "IMPORT_STATUS" ## [13] "INDUSTRY_CLASS" "INDUSTRY_MODEL_CODE" ## [15] "MAKE" "MODEL" ## [17] "MOTIVE_POWER" "MVMA_MODEL_CODE" ## [19] "NUMBER_OF_AXLES" "NUMBER_OF_SEATS" ``` --- ## Distribution of car weight .pull-left[ ```r cars.df %>% ggplot(aes(x = GROSS_VEHICLE_MASS)) + geom_histogram() ``` ] .pull-right[ <img src="data:image/png;base64,#img/W02L02-p11-1.png" width="100%" style="display: block; margin: auto;" /> <!-- Q: what do you see from the plot? --> ] --- ## Filter zero weight cars .pull-left[ ```r cars.df %>% * filter(GROSS_VEHICLE_MASS > 0, * POWER_RATING > 0) %>% ggplot(aes(x = POWER_RATING, y = GROSS_VEHICLE_MASS)) + geom_point() ``` ] .pull-right[ <img src="data:image/png;base64,#img/W02L02-p12-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Better scatter plot .pull-left[ ```r cars.df %>% filter(GROSS_VEHICLE_MASS > 0, POWER_RATING > 0) %>% ggplot(aes(x = POWER_RATING, y = GROSS_VEHICLE_MASS)) + * geom_point(alpha = 0.05) ``` ] .pull-right[ <img src="data:image/png;base64,#img/W02L02-p13-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Try hex(bin) plot .pull-left[ ```r cars.df %>% filter(GROSS_VEHICLE_MASS > 0, POWER_RATING > 0) %>% ggplot(aes(x = POWER_RATING, y = GROSS_VEHICLE_MASS)) + * geom_hex() ``` ] .pull-right[ <img src="data:image/png;base64,#img/W02L02-p14-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Further exploration .pull-left[ ```r cars.df %>% filter(GROSS_VEHICLE_MASS > 0, POWER_RATING > 0, MAKE == 'TOYOTA') %>% * ggplot(aes(x = FIRST_NZ_REGISTRATION_YEAR, * y = GROSS_VEHICLE_MASS)) + geom_point() ``` ] .pull-right[ <img src="data:image/png;base64,#img/W02L02-p15-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Add 'jitter' .pull-left[ ```r cars.df %>% filter(GROSS_VEHICLE_MASS > 0, POWER_RATING > 0, MAKE == 'TOYOTA') %>% ggplot(aes(x = FIRST_NZ_REGISTRATION_YEAR, y = GROSS_VEHICLE_MASS)) + * geom_jitter() ``` ] .pull-right[ <img src="data:image/png;base64,#img/W02L02-p16-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Scales, labels and theme If you are making plots for others, it is a good idea to make them clear and readable. .pull-left[ ```r p <- cars.df %>% filter(GROSS_VEHICLE_MASS > 0, POWER_RATING > 0) %>% ggplot(aes(x = POWER_RATING, y = GROSS_VEHICLE_MASS)) + geom_point(alpha = 0.05) + * labs(title = "Engine power vs. Car weight", * x = "Power rating (kw)", * y = 'Vehicle mass (kg)', * caption = 'Data from nzta.govt.nz') p ``` ] .pull-right[ <img src="data:image/png;base64,#img/W02L02-p17-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Get the scale and coordinate right, and a different theme .pull-left[ ```r p + * scale_y_continuous( limits = c(0,3500), breaks = seq(0,3500,by=500)) + coord_cartesian(xlim = c(0, 500)) + * theme_minimal() ``` ] .pull-right[ <img src="data:image/png;base64,#img/W02L02-p18-1.png" width="100%" style="display: block; margin: auto;" /> ] <!-- ### EOF ### --> --- exclude: true --- ## Which (common) plot to use? #### Single Variable (univariate) Type|(Common) Plot to use|Features to Pay Attention to ----|--------------------|---------------------------- Quantitative (e.g. a variable of measurement)| dot plot/stript chart, histogram, density plot, box plot|shape, peaks, center, variability, outliers. Qualitative (e.g. count of a grouping variable)|bar plot, pie chart, table of counts|majority/minority group, gaps in group counts. <br> --- ## Which (common) plot to use? #### Two Variables (bivariate) Type|(Common) Plot to use|Features to Pay Attention to ----|--------------------|---------------------------- Quantitative vs. Qualitative|side-by-side histogram/density/box plot|compare shapes, centers, variability; outliers from individual group. Quantitative vs. Quantitative|scatter plot, line plot|shapes, peaks, center, variability, outliers, correlation, grouping of observations, seasonal variation (for time series). Qualitative vs. Qualitative|faceted bar plot, 2-way table of counts, pie chart (?)|compare group counts, distributions and gaps. NB: Plotting for 2+ variables can often be achieved by 'reducing' it to some variations of bi-variate plots. --- ## General Plotting Advice * Use colors, shapes etc, but keep things **balanced**. * Keep the focus -- produce a plot with clear message in mind. * Be aware of scales, labels and Hierarchy. * Leave some white space. * ... --- ## General Plotting Advice * Avoid pie charts (?!) > "Avoid pie-charts. Especially 3d pie-charts. Especially 3d pie-charts with exploding wedges. I promise all my students an instant fail if I ever see anything so appalling." - Rob J Hyndman, from ["Twenty rules for good graphics"](https://robjhyndman.com/hyndsight/graphics/) * Sometimes, it would be helpful to produce multiple (types of) plots for the same data to reveal the real pattern. For example... --- ...For example, what can you see from the boxplot below? <img src="data:image/png;base64,#img/W02L03-p1-1.png" width="75%" style="display: block; margin: auto;" /> --- But if we look at the density plot... <img src="data:image/png;base64,#img/W02L03-p2-1.png" width="75%" style="display: block; margin: auto;" /> --- ## Charts and accessibility While charts are very much a visual medium, we can improve accessibility of our charts by including 'alternative text', often known as 'alt text'. ### 📚 To read Cesal., A. (2020). *Writing Alt Text for Data Visualization*. Nightingale https://nightingaledvs.com/writing-alt-text-for-data-visualization/ <!-- ### EOF ### -->